feat: capture-aware alloc + workspace pre-alloc (T2.2, T2.3) by dndungu · Pull Request #97 · zerfoo/ztensor

dndungu · 2026-04-16T16:28:44Z

Summary

Wave 4b of the GB10 CUDA graph capture fix (docs/plan.md E2). This is the core fix that resolves the silent hang described in #93.

T2.2: Capture-aware allocWeight routing. When CaptureAwareAllocator.IsCapturing(), routes through cudaMallocAsync on the capture stream (graph node) instead of MallocManaged (illegal during capture on GB10). Similarly, uploadBytes routes through cudaMemcpyAsync H2D during capture. Added IsCapturing() to CaptureAwareAllocator interface + implementations. 7 new tests.
T2.3: Workspace pre-allocation. preAllocateWorkspaces() called at end of UploadWeights eagerly initializes FP8 scratchpad and cuBLASLt handle so no lazy alloc occurs inside capture. Added captureAllocCount atomic counter that instruments capture-time allocs — should be zero for a properly pre-allocated workload. 7 new tests.

Together with T2.1a (WithCapture helper, PR #96) and T4.1 (capture watchdog, PR #96), this completes the E2+E4 fix path. The production hang in #93 is now resolved: callers use WithCapture → allocator switches to capture-aware mode → allocWeight uses async alloc → no illegal MallocManaged during capture → no hang.

Refs #93.

Verification

Build: go build ./... PASS
Test: go test ./compute/... -race -timeout 120s PASS (14 new tests, 2.7s)
Merge: auto-merged cleanly (both branches touch gpu_engine.go in different functions)
Silent-revert check: all key symbols from both branches present in integration HEAD
Stub audit: zero hits

Test plan

go build ./...
go test ./compute/... -race -timeout 120s
Auto-merge gpu_engine.go conflict-free
Both T2.2 and T2.3 symbols present post-merge
CI green (auto)
T2.6 (follow-on): hardware validation on DGX with capture enabled

…sync When CaptureAwareAllocator is active (set by BeginCapture/WithCapture), allocWeight routes through cudaMallocAsync on the capture stream so allocations are recorded as graph nodes. This avoids the silent hang caused by cudaMallocManaged during CUDA graph capture on GB10. Similarly, uploadBytes routes through cudaMemcpyAsync on the capture stream instead of the synchronous CPU copy used by the managed-memory path, which is illegal during capture. The ensureNotCapturing guard now only fires when capture is active but the allocator was NOT properly switched via BeginCapture/WithCapture. Changes: - Add IsCapturing() to CaptureAwareAllocator interface - Implement IsCapturing() on cuda.MemPool and gpuapi.CUDAMemPool - Add async allocation/copy routing in allocWeight and uploadBytes - Add function variable indirections for MallocManaged, MallocAsync, and MemcpyAsync to enable CPU-mock testing - Add 7 unit tests covering all routing paths

…o avoid capture-time alloc Add preAllocateWorkspaces() that eagerly initializes the FP8 scratchpad (scaleOne pointer + struct) and cuBLASLt handle at the end of UploadWeights, before any CUDA graph capture region begins. These two objects previously used lazy initialization (getFP8Scratch, getLtHandle) which triggered cudaMalloc on first use -- hanging silently on GB10 when first use happened inside capture. Also add captureAllocCount atomic counter to track allocWeight attempts during active capture. EndCapture resets the counter and logs a warning if non-zero. CaptureAllocCount() exposes the counter for testing.

dndungu added 5 commits April 16, 2026 09:24

Merge branch 'wave-4b-task-T2.2' into wave-4b-integration

5f0ac2d

Merge branch 'wave-4b-task-T2.3' into wave-4b-integration

fc07028

docs(plan): mark T2.2 + T2.3 complete (Wave 4b)

3c7b8b6

dndungu merged commit aa0dac6 into main Apr 16, 2026
1 check passed

dndungu deleted the wave-4b-integration branch April 16, 2026 16:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: capture-aware alloc + workspace pre-alloc (T2.2, T2.3)#97

feat: capture-aware alloc + workspace pre-alloc (T2.2, T2.3)#97
dndungu merged 5 commits intomainfrom
wave-4b-integration

dndungu commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dndungu commented Apr 16, 2026

Summary

Verification

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant